Impact of System and Cache Bandwidth on Stencil Computations Across Multiple Processor Generations
نویسندگان
چکیده
We compare old single-core multi-processor systems against multi-core processors and study the question which improvements are most relevant for increasing the performance on stencil computations. Even before the multi-core era began, the bandwidth wall, the discrepancy between off-chip bandwidth requirements and system bandwidth performance, was already a significant problem. Because of the currently growing number of parallel cores in CPUs this discrepancy could only be stopped from further deterioration by introducing dual-, tripleand quad-channel memory interfaces. However, this type of off-chip bandwidth scaling is too expensive and thus only a temporary relieve that cannot keep up indefinitely with the exponentially growing number of cores. Therefore, we analyze in particular how the scaling of system and cache bandwidth affects the performance of stencil computations. We evaluate both naive stencil implementations as well as time skewing variants that exploit temporal locality and minimize the number of cache misses in case of iterative stencil computations. We prove certain invariance properties of the schemes and develop a corresponding performance model. Then, we use this model to find out which hardware improvements in the old single-core processors are necessary to match the performance of the new multi-core processors. From this we can draw conclusions about most effective improvements for future processors.
منابع مشابه
Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors
Stencil-based kernels constitute the core of many important scientific applications on blockstructured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. In this paper, we explore the impact of trends in memory subsystems on a variety of stencil optimization techniques and develop performance mod...
متن کاملA 3D-Stacked Memory Manycore Stencil Accelerator System
Stencil operations are an important class of scientific computational kernels that are pervasive in scientific simulations as well as in image processing. A key characteristic of this class of computation is that they have a low operational intensity, i.e., the ratio of the number of memory accesses to the number of floating point operations it performs is high. As a result, the performance of ...
متن کاملAn Auto-tuning Jit Compiler for Accelerating Multiple Stencil Computations
We present a JIT compiler with auto-tuning capabilities fusing multiple stencil computations. Data arrays for scientific computing of image processing often exceed cache-memory size. To take advantage of spatial and temporal locality, a common method is to partition the images into tiling blocks for multicore architectures. In realistic scenarios, the multiple image algorithms, most of which ar...
متن کاملOvercoming Bandwidth Limitations in Visual Computing
Because visual computations are very data intensive they are often limited by the bandwidth of the system rather than its peak computational performance. The trend towards many-core architectures exacerbates the problem because the parallel cores let the compute capability grow exponentially while the system bandwidth increases only linearly. At the core of the bandwidth problem in visual compu...
متن کاملGPU-UniCache: Automatic Code Generation of Spatial Blocking for Stencils on GPUs
Spatial blocking is a critical memory-access optimization to efficiently exploit the computing resources of parallel processors, such as many-core GPUs. By reusing cache-loaded data over multiple spatial iterations, spatial blocking can significantly lessen the pressure of accessing slow global memory. Stencil computations, for example, can exploit such data reuse via spatial blocking through t...
متن کامل